Fire up graphlab create



In [53]:

    
import graphlab
import graphlab as gl

Load some house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.



In [2]:

    
sales = graphlab.SFrame('home_data.gl/')









    



[INFO] This non-commercial license of GraphLab Create is assigned to iliassweb@gmail.comand will expire on September 22, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-15992 - Server binary: /home/zax/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1443511855.log
[INFO] GraphLab Server Version: 1.6.1



In [3]:

    
sales









    Out[3]:





    
        id
        date
        price
        bedrooms
        bathrooms
        sqft_living
        sqft_lot
        floors
        waterfront
    
    
        7129300520
        2014-10-13 00:00:00+00:00
        221900
        3
        1
        1180
        5650
        1
        0
    
    
        6414100192
        2014-12-09 00:00:00+00:00
        538000
        3
        2.25
        2570
        7242
        2
        0
    
    
        5631500400
        2015-02-25 00:00:00+00:00
        180000
        2
        1
        770
        10000
        1
        0
    
    
        2487200875
        2014-12-09 00:00:00+00:00
        604000
        4
        3
        1960
        5000
        1
        0
    
    
        1954400510
        2015-02-18 00:00:00+00:00
        510000
        3
        2
        1680
        8080
        1
        0
    
    
        7237550310
        2014-05-12 00:00:00+00:00
        1225000
        4
        4.5
        5420
        101930
        1
        0
    
    
        1321400060
        2014-06-27 00:00:00+00:00
        257500
        3
        2.25
        1715
        6819
        2
        0
    
    
        2008000270
        2015-01-15 00:00:00+00:00
        291850
        3
        1.5
        1060
        9711
        1
        0
    
    
        2414600126
        2015-04-15 00:00:00+00:00
        229500
        3
        1
        1780
        7470
        1
        0
    
    
        3793500160
        2015-03-12 00:00:00+00:00
        323000
        3
        2.5
        1890
        6560
        2
        0
    


    
        view
        condition
        grade
        sqft_above
        sqft_basement
        yr_built
        yr_renovated
        zipcode
        lat
    
    
        0
        3
        7
        1180
        0
        1955
        0
        98178
        47.51123398
    
    
        0
        3
        7
        2170
        400
        1951
        1991
        98125
        47.72102274
    
    
        0
        3
        6
        770
        0
        1933
        0
        98028
        47.73792661
    
    
        0
        5
        7
        1050
        910
        1965
        0
        98136
        47.52082
    
    
        0
        3
        8
        1680
        0
        1987
        0
        98074
        47.61681228
    
    
        0
        3
        11
        3890
        1530
        2001
        0
        98053
        47.65611835
    
    
        0
        3
        7
        1715
        0
        1995
        0
        98003
        47.30972002
    
    
        0
        3
        7
        1060
        0
        1963
        0
        98198
        47.40949984
    
    
        0
        3
        7
        1050
        730
        1960
        0
        98146
        47.51229381
    
    
        0
        3
        7
        1890
        0
        2003
        0
        98038
        47.36840673
    


    
        long
        sqft_living15
        sqft_lot15
    
    
        -122.25677536
        1340.0
        5650.0
    
    
        -122.3188624
        1690.0
        7639.0
    
    
        -122.23319601
        2720.0
        8062.0
    
    
        -122.39318505
        1360.0
        5000.0
    
    
        -122.04490059
        1800.0
        7503.0
    
    
        -122.00528655
        4760.0
        101930.0
    
    
        -122.32704857
        2238.0
        6819.0
    
    
        -122.31457273
        1650.0
        9711.0
    
    
        -122.33659507
        1780.0
        8113.0
    
    
        -122.0308176
        2390.0
        7570.0
    

[21613 rows x 21 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Exploring the data for housing sales

The house price is correlated with the number of square feet of living space.



In [3]:

    
graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")

Create a simple regression model of sqft_living to price

Split data into training and testing.
We use seed=0 so that everyone running this notebook gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).



In [54]:

    
train_data,test_data = sales.random_split(.8,seed=0)

Build the regression model using only sqft_living as a feature



In [8]:

    
sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'])









    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16521
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 1
PROGRESS: Number of coefficients    : 2
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 1.005403     | 4305232.665991     | 4382433.056547       | 260465.460106 | 306864.138614   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

Evaluate the simple model



In [8]:

    
print test_data['price'].mean()









    



543054.042563



In [9]:

    
print sqft_model.evaluate(test_data)









    



{'max_error': 4153115.4452328016, 'rmse': 255169.84496896976}

RMSE of about \$255,170!

Let's show what our predictions look like

Matplotlib is a Python plotting library that is also useful for plotting. You can install it with:

'pip install matplotlib'



In [78]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [9]:

    
plt.plot(test_data['sqft_living'],test_data['price'],'.',
        test_data['sqft_living'],sqft_model.predict(test_data),'-')









    Out[9]:





[<matplotlib.lines.Line2D at 0x7ff849a98e90>,
 <matplotlib.lines.Line2D at 0x7ff849b4aed0>]

Above: blue dots are original data, green line is the prediction from the simple regression.

Below: we can view the learned regression coefficients.



In [12]:

    
sqft_model.get('coefficients')









    Out[12]:





    
        name
        index
        value
    
    
        (intercept)
        None
        -44850.1725885
    
    
        sqft_living
        None
        280.76185312
    

[2 rows x 3 columns]

Explore other features in the data

To build a more elaborate model, we will explore using more features.



In [50]:

    
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']



In [11]:

    
sales[my_features].show()



In [12]:

    
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')

Pull the bar at the bottom to view more of the data.

98039 is the most expensive zip code.

Build a regression model with more features



In [75]:

    
my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features)









    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16520
PROGRESS: Number of features          : 6
PROGRESS: Number of unpacked features : 6
PROGRESS: Number of coefficients    : 115
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.031962     | 3759581.734076     | 1436881.747477       | 182990.238290 | 161057.605191   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+



In [19]:

    
print my_features









    



['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

Comparing the results of the simple model with adding more features



In [20]:

    
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)









    



{'max_error': 4153115.4452328016, 'rmse': 255169.84496896976}
{'max_error': 3491283.748780556, 'rmse': 179507.57212792453}

The RMSE goes down from \$255,170 to \$179,508 with more features.

Apply learned models to predict prices of 3 houses

The first house we will use is considered an "average" house in Seattle.



In [85]:

    
house1 = sales[sales['id']=='5309101200']



In [86]:

    
print house1









    



+------------+---------------------------+--------+----------+-----------+-------------+
|     id     |            date           | price  | bedrooms | bathrooms | sqft_living |
+------------+---------------------------+--------+----------+-----------+-------------+
| 5309101200 | 2014-06-05 00:00:00+00:00 | 620000 |    4     |    2.25   |     2400    |
+------------+---------------------------+--------+----------+-----------+-------------+
+----------+--------+------------+------+-----------+-------+------------+---------------+
| sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement |
+----------+--------+------------+------+-----------+-------+------------+---------------+
|   5350   |  1.5   |     0      |  0   |     4     |   7   |    1460    |      940      |
+----------+--------+------------+------+-----------+-------+------------+---------------+
+----------+--------------+---------+-------------+---------------+---------------+-----+
| yr_built | yr_renovated | zipcode |     lat     |      long     | sqft_living15 | ... |
+----------+--------------+---------+-------------+---------------+---------------+-----+
|   1929   |      0       |  98117  | 47.67632376 | -122.37010126 |     1250.0    | ... |
+----------+--------------+---------+-------------+---------------+---------------+-----+
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.



In [57]:

    
print house1['price']









    



[620000, ... ]



In [58]:

    
print sqft_model.predict(house1)









    



[627571.3060718876]



In [59]:

    
print my_features_model.predict(house1)









    



[720032.9838529036]

In this case, the model with more features provides a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features is better.

Prediction for a second, fancier house

We will now examine the predictions for a fancier house.



In [87]:

    
house2 = sales[sales['id']=='1925069082']



In [88]:

    
print house2









    



+------------+---------------------------+---------+----------+-----------+-------------+
|     id     |            date           |  price  | bedrooms | bathrooms | sqft_living |
+------------+---------------------------+---------+----------+-----------+-------------+
| 1925069082 | 2015-05-11 00:00:00+00:00 | 2200000 |    5     |    4.25   |     4640    |
+------------+---------------------------+---------+----------+-----------+-------------+
+----------+--------+------------+------+-----------+-------+------------+---------------+
| sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement |
+----------+--------+------------+------+-----------+-------+------------+---------------+
|  22703   |   2    |     1      |  4   |     5     |   8   |    2860    |      1780     |
+----------+--------+------------+------+-----------+-------+------------+---------------+
+----------+--------------+---------+-------------+---------------+---------------+-----+
| yr_built | yr_renovated | zipcode |     lat     |      long     | sqft_living15 | ... |
+----------+--------------+---------+-------------+---------------+---------------+-----+
|   1952   |      0       |  98052  | 47.63925783 | -122.09722322 |     3140.0    | ... |
+----------+--------------+---------+-------------+---------------+---------------+-----+
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.



In [63]:

    
print sqft_model.predict(house2)









    



[1251984.801173909]



In [64]:

    
print my_features_model.predict(house2)









    



[1464737.34675824]

In this case, the model with more features provides a better prediction. This behavior is expected here, because this house is more differentiated by features that go beyond its square feet of living space, especially the fact that it's a waterfront house.

Last house, super fancy

Our last house is a very large one owned by a famous Seattleite.



In [65]:

    
bill_gates = {'bedrooms':[8], 
              'bathrooms':[25], 
              'sqft_living':[50000], 
              'sqft_lot':[225000],
              'floors':[4], 
              'zipcode':['98039'], 
              'condition':[10], 
              'grade':[10],
              'waterfront':[1],
              'view':[4],
              'sqft_above':[37500],
              'sqft_basement':[12500],
              'yr_built':[1994],
              'yr_renovated':[2010],
              'lat':[47.627606],
              'long':[-122.242054],
              'sqft_living15':[5000],
              'sqft_lot15':[40000]}



In [89]:

    
print my_features_model.predict(graphlab.SFrame(bill_gates))









    



[13662489.551216738]

The model predicts a price of over $13M for this house! But we expect the house to cost much more. (There are very few samples in the dataset of houses that are this fancy, so we don't expect the model to capture a perfect prediction here.)

Average of the highest zipcode



In [14]:

    
hiZipcode = sales[sales['zipcode'] == '98039']



In [15]:

    
print hiZipcode









    



+------------+---------------------------+---------+----------+-----------+-------------+
|     id     |            date           |  price  | bedrooms | bathrooms | sqft_living |
+------------+---------------------------+---------+----------+-----------+-------------+
| 3625049014 | 2014-08-29 00:00:00+00:00 | 2950000 |    4     |    3.5    |     4860    |
| 2540700110 | 2015-02-12 00:00:00+00:00 | 1905000 |    4     |    3.5    |     4210    |
| 3262300940 | 2014-11-07 00:00:00+00:00 |  875000 |    3     |     1     |     1220    |
| 3262300940 | 2015-02-10 00:00:00+00:00 |  940000 |    3     |     1     |     1220    |
| 6447300265 | 2014-10-14 00:00:00+00:00 | 4000000 |    4     |    5.5    |     7080    |
| 2470100110 | 2014-08-04 00:00:00+00:00 | 5570000 |    5     |    5.75   |     9200    |
| 2210500019 | 2015-03-24 00:00:00+00:00 |  937500 |    3     |     1     |     1320    |
| 6447300345 | 2015-04-06 00:00:00+00:00 | 1160000 |    4     |     3     |     2680    |
| 6447300225 | 2014-11-06 00:00:00+00:00 | 1880000 |    3     |    2.75   |     2620    |
| 2525049148 | 2014-10-07 00:00:00+00:00 | 3418800 |    5     |     5     |     5450    |
+------------+---------------------------+---------+----------+-----------+-------------+
+----------+--------+------------+------+-----------+-------+------------+---------------+
| sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement |
+----------+--------+------------+------+-----------+-------+------------+---------------+
|  23885   |   2    |     0      |  0   |     3     |   12  |    4860    |       0       |
|  18564   |   2    |     0      |  0   |     3     |   11  |    4210    |       0       |
|   8119   |   1    |     0      |  0   |     4     |   7   |    1220    |       0       |
|   8119   |   1    |     0      |  0   |     4     |   7   |    1220    |       0       |
|  16573   |   2    |     0      |  0   |     3     |   12  |    5760    |      1320     |
|  35069   |   2    |     0      |  0   |     3     |   13  |    6200    |      3000     |
|   8500   |   1    |     0      |  0   |     4     |   7   |    1320    |       0       |
|  15438   |   2    |     0      |  2   |     3     |   8   |    2680    |       0       |
|  17919   |   1    |     0      |  1   |     4     |   9   |    2620    |       0       |
|  20412   |   2    |     0      |  0   |     3     |   11  |    5450    |       0       |
+----------+--------+------------+------+-----------+-------+------------+---------------+
+----------+--------------+---------+-------------+---------------+---------------+-----+
| yr_built | yr_renovated | zipcode |     lat     |      long     | sqft_living15 | ... |
+----------+--------------+---------+-------------+---------------+---------------+-----+
|   1996   |      0       |  98039  | 47.61717049 | -122.23040939 |     3580.0    | ... |
|   2001   |      0       |  98039  | 47.62060082 |  -122.2245047 |     3520.0    | ... |
|   1955   |      0       |  98039  | 47.63281908 | -122.23554392 |     1910.0    | ... |
|   1955   |      0       |  98039  | 47.63281908 | -122.23554392 |     1910.0    | ... |
|   2008   |      0       |  98039  | 47.61512031 | -122.22420058 |     3140.0    | ... |
|   2001   |      0       |  98039  | 47.62888314 | -122.23346379 |     3560.0    | ... |
|   1954   |      0       |  98039  | 47.61872888 | -122.22643371 |     2790.0    | ... |
|   1902   |     1956     |  98039  | 47.61089438 | -122.22582388 |     4480.0    | ... |
|   1949   |      0       |  98039  | 47.61435052 | -122.22772057 |     3400.0    | ... |
|   2014   |      0       |  98039  | 47.62087993 | -122.23726918 |     3160.0    | ... |
+----------+--------------+---------+-------------+---------------+---------------+-----+
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

Mean/ Average price of highest Zipcode



In [19]:

    
hiZipcode['price'].mean()









    Out[19]:





2160606.5999999996

Filtering data / Houses with 2000 <square feet < 4000



In [41]:

    
## Houses with 2000 <square feet < 4000
myHouses = sales[(sales['sqft_living'] >= 2000) & (sales['sqft_living'] <= 4000)]



In [42]:

    
print myHouses.show(view='BoxWhisker Plot', x='price', y='sqft_living')

Counting the fraction of houses in the range 2000 < sqft_living < 4000



In [100]:

    
myhouses_count = len(myHouses['id'])
allhouses_count = len(sales['id'])
print myhouses_count
print allhouses_count



In [103]:

    
## Fraction
print 9221/21613.0









    



0.426641373248

Building a regression model with advanced features



In [46]:

    
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house
'grade', # measure of quality of construction
'waterfront', # waterfront property
'view', # type of view
'sqft_above', # square feet above ground
'sqft_basement', # square feet in basement
'yr_built', # the year built
'yr_renovated', # the year renovated
'lat', 'long', # the lat-long of the parcel
'sqft_living15', # average sq.ft. of 15 nearest neighbors 
'sqft_lot15', # average lot size of 15 nearest neighbors 
]



In [51]:

    
## Show
sales[advanced_features].show()



In [69]:

    
# Building the advanced model
advanced_model = gl.linear_regression.create(train_data, target='price', features = advanced_features)









    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16482
PROGRESS: Number of features          : 18
PROGRESS: Number of unpacked features : 18
PROGRESS: Number of coefficients    : 127
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.041939     | 3472672.013489     | 2352287.602095       | 154041.671914 | 166014.881217   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

Testing the advenced model



In [73]:

    
## evalute
print advanced_model.evaluate(test_data)









    



{'max_error': 3568266.879823326, 'rmse': 157229.23779267372}

Comparing the results of the my_feature model with adding more advanced features



In [77]:

    
print advanced_model.evaluate(test_data)
print my_features_model.evaluate(test_data)









    



{'max_error': 3568266.879823326, 'rmse': 157229.23779267372}
{'max_error': 3493010.4960461315, 'rmse': 179583.27738771832}



In [83]:

    
print advanced_model.get('coefficients')









    



+-------------+-------+----------------+
|     name    | index |     value      |
+-------------+-------+----------------+
| (intercept) |  None | -3135128.00156 |
|   bedrooms  |   2   | 5158.44562239  |
|   bedrooms  |   4   | -20852.9554906 |
|   bedrooms  |   5   | -36514.3155097 |
|   bedrooms  |   1   | 22781.6357966  |
|   bedrooms  |   6   | -86100.8361738 |
|   bedrooms  |   7   | -239258.210931 |
|   bedrooms  |   8   | -127512.963347 |
|   bedrooms  |   0   | 36815.7845627  |
|   bedrooms  |   9   | -159561.91187  |
+-------------+-------+----------------+
[127 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Visualization of advanced model prediction



In [90]:

    
plt.plot(test_data['sqft_living'],test_data['price'],'o',
        test_data['sqft_living'], advanced_model.predict(test_data), '-')









    Out[90]:





[<matplotlib.lines.Line2D at 0x7ff83d0e27d0>,
 <matplotlib.lines.Line2D at 0x7ff83d0e2990>]

Predicting with advanced model



In [93]:

    
print house1['price']
print sqft_model.predict(house1)
print my_features_model.predict(house1)
print advanced_model.predict(house1)









    



[620000]
[627571.3060718876]
[718712.2048351087]
[631684.5039731786]

predicting Bill Gates ' s house price



In [94]:

    
## using advanced_model
print advanced_model.predict(gl.SFrame(bill_gates))









    



[10548348.128938712]



In [95]:

    
## sqft_model 
print sqft_model.predict(gl.SFrame(bill_gates))









    



[13896358.076989843]



In [96]:

    
## my_features_model
print my_features_model.predict(gl.SFrame(bill_gates))









    



[13662489.551216738]



In [ ]:

id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors
7129300520	2014-10-13 00:00:00+00:00	221900	3	1	1180	5650	1
6414100192	2014-12-09 00:00:00+00:00	538000	3	2.25	2570	7242	2
5631500400	2015-02-25 00:00:00+00:00	180000	2	1	770	10000	1
2487200875	2014-12-09 00:00:00+00:00	604000	4	3	1960	5000	1
1954400510	2015-02-18 00:00:00+00:00	510000	3	2	1680	8080	1
7237550310	2014-05-12 00:00:00+00:00	1225000	4	4.5	5420	101930	1
1321400060	2014-06-27 00:00:00+00:00	257500	3	2.25	1715	6819	2
2008000270	2015-01-15 00:00:00+00:00	291850	3	1.5	1060	9711	1
2414600126	2015-04-15 00:00:00+00:00	229500	3	1	1780	7470	1
3793500160	2015-03-12 00:00:00+00:00	323000	3	2.5	1890	6560	2

condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat
3	7	1180	0	1955	0	98178	47.51123398
3	7	2170	400	1951	1991	98125	47.72102274
3	6	770	0	1933	0	98028	47.73792661
5	7	1050	910	1965	0	98136	47.52082
3	8	1680	0	1987	0	98074	47.61681228
3	11	3890	1530	2001	0	98053	47.65611835
3	7	1715	0	1995	0	98003	47.30972002
3	7	1060	0	1963	0	98198	47.40949984
3	7	1050	730	1960	0	98146	47.51229381
3	7	1890	0	2003	0	98038	47.36840673

long	sqft_living15	sqft_lot15
-122.25677536	1340.0	5650.0
-122.3188624	1690.0	7639.0
-122.23319601	2720.0	8062.0
-122.39318505	1360.0	5000.0
-122.04490059	1800.0	7503.0
-122.00528655	4760.0	101930.0
-122.32704857	2238.0	6819.0
-122.31457273	1650.0	9711.0
-122.33659507	1780.0	8113.0
-122.0308176	2390.0	7570.0